Counting Positives Accurately Despite Inaccurate Classification
نویسنده
چکیده
Most supervised machine learning research assumes the training set is a random sample from the target population, thus the class distribution is invariant. In real world situations, however, the class distribution changes, and is known to erode the effectiveness of classifiers and calibrated probability estimators. This paper focuses on the problem of accurately estimating the number of positives in the test set—quantification—as opposed to classifying individual cases accurately. It compares three methods: classify & count, an adjusted variant, and a mixture model. An empirical evaluation on a text classification benchmark reveals that the simple method is consistently biased, and that the mixture model is surprisingly effective even when positives are very scarce in the training set—a common case in information retrieval. 1 Motivation and Scope We address the problem of estimating the number of positives in a target population, given a training set from which to learn to distinguish positives from negatives. This could be used, for example, to estimate the number of news articles about terrorism each month, or the volume of advertising by a competitor over time. Unlike previous literature in machine learning, our end goal is not to determine a classification for each item, but only to estim ed to classification. This is an class Fig. 1. Counting posi negatives yields a poor even though the classifie ate the number of positives—quantification as oppos important problem in real-world situations where the
منابع مشابه
Identification and Determination of the Number of Green Citrus Fruit under Different Ambient Light Conditions
Yield mapping by machine aided harvesting requires automatic detection and counting of fruit in a tree canopy. However, occlusion, varying illumination, and similarity with the background make fruit identification a very challenging task. Moreover, green citrus detection within green canopy is a very difficult problem due to the issues previously mentioned. In this study, a novel and simple tec...
متن کاملDetection and Counting of On-Tree Citrus Fruit for Crop Yield Estimation
In this paper, we present a technique to estimate citrus fruit yield from the tree images. Manually counting the fruit for yield estimation for marketing and other managerial tasks is time consuming and requires human resources, which do not always come cheap. Different approaches have been used for the said purpose, yet separation of fruit from its background poses challenges, and renders the ...
متن کاملWhy do Nigerian Scammers Say They are From Nigeria?
False positives cause many promising detection technologies to be unworkable in practice. Attackers, we show, face this problem too. In deciding who to attack true positives are targets successfully attacked, while false positives are those that are attacked but yield nothing. This allows us to view the attacker’s problem as a binary classification. The most profitable strategy requires accurat...
متن کاملAutoscaling Bloom Filter: Controlling Trade-off Between True and False Positives
A Bloom filter is a simple data structure supporting membership queries on a set. The standard Bloom filter does not support the delete operation, therefore, many applications use a counting Bloom filter allowing the deletion. This paper proposes a generalization of the counting Bloom filters approach, called “autoscaling Bloom filters”, which allows elastic adjustment of its capacity with prob...
متن کاملUsing the neurobehavioral cognitive status examination as a screening measure for older adults.
We evaluated the ability of the Neurobehavioral Cognitive Status Examination (NCSE) to accurately distinguish between healthy older adults and geriatric patients suffering from dementia. Although the NCSE correctly identified all dementia patients, it produced an unacceptably high rate of false positives among the healthy elderly (70%). Despite the NCSE's lack of specificity when using the reco...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005